Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT

نویسندگان

  • Yining Wang
  • Long Zhou
  • Jiajun Zhang
  • Chengqing Zong
چکیده

Neural machine translation (NMT), a new approach to machine translation, has been proved to outperform conventional statistical machine translation (SMT) across a variety of language pairs. Translation is an open-vocabulary problem, but most existing NMT systems operate with a fixed vocabulary, which causes the incapability of translating rare words. This problem can be alleviated by using different translation granularities, such as character, subword and hybrid word-character. Translation involving Chinese is one of the most difficult tasks in machine translation, however, to the best of our knowledge, there has not been any other work exploring which translation granularity is most suitable for Chinese in NMT. In this paper, we conduct an extensive comparison using Chinese-English NMT as a case study. Furthermore, we discuss the advantages and disadvantages of various translation granularities in detail. Our experiments show that subword model performs best for Chinese-to-English translation with the vocabulary which is not so big while hybrid word-character model is most suitable for Englishto-Chinese translation. Moreover, experiments of different granularities show that Hybrid BPE method can achieve best result on Chinese-toEnglish translation task.

منابع مشابه

University of Rochester WMT 2017 NMT System Submission

We describe the neural machine translation system submitted by the University of Rochester to the Chinese-English language pair for the WMT 2017 news translation task. We applied unsupervised word and subword segmentation techniques and deep learning in order to address (i) the word segmentation problem caused by the lack of delimiters between words and phrases in Chinese and (ii) the morpholog...

متن کامل

Neural Machine Translation of Rare Words with Subword Units

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as seq...

متن کامل

A Character-Aware Encoder for Neural Machine Translation

This article proposes a novel character-aware neural machine translation (NMT) model that views the input sequences as sequences of characters rather than words. On the use of row convolution (Amodei et al., 2015), the encoder of the proposed model composes word-level information from the input sequences of characters automatically. Since our model doesn’t rely on the boundaries between each wo...

متن کامل

Literature Survey: Study of Neural Machine Translation

We build Neural Machine Translation (NMT) systems for EnglishHindi,Bengali-Hindi and Gujarati-Hindi with two different units of translation i.e. word and subword and present a comparative study of subword NMT and word level NMT systems, along with strong results and case studies. We train attention-based encoder-decoder model for word level and use Byte Pair Encoding (BPE) in subword NMT for wo...

متن کامل

Pre-Reordering for Neural Machine Translation: Helpful or Harmful?

Pre-reordering, a preprocessing to make the source-side word orders close to those of the target side, has been proven very helpful for statistical machine translation (SMT) in improving translation quality. However, is it the case in neural machine translation (NMT)? In this paper, we firstly investigate the impact of pre-reordered source-side data onNMT, and then propose to incorporate featur...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:
  • CoRR

دوره abs/1711.04457  شماره 

صفحات  -

تاریخ انتشار 2017